AITopics | statistical machine translation

Collaborating Authors

statistical machine translation

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

A New NMT Model for Translating Clinical Texts from English to Spanish

Li, Rumeng, Wang, Xun, Yu, Hong

arXiv.org Artificial IntelligenceAug-27-2025

Translating electronic health record (EHR) narratives from English to Spanish is a clinically important yet challenging task due to the lack of a parallel-aligned corpus and the abundant unknown words contained. To address such challenges, we propose \textbf{NOOV} (for No OOV), a new neural machine translation (NMT) system that requires little in-domain parallel-aligned corpus for training. NOOV integrates a bilingual lexicon automatically learned from parallel-aligned corpora and a phrase look-up table extracted from a large biomedical knowledge resource, to alleviate both the unknown word problem and the word-repeat challenge in NMT, enhancing better phrase generation of NMT systems. Evaluation shows that NOOV is able to generate better translation of EHR with improvement in both accuracy and fluency.

artificial intelligence, natural language, translation, (17 more...)

arXiv.org Artificial Intelligence

2508.18607

Country: North America > United States > Massachusetts (0.94)

Genre: Research Report (1.00)

Industry:

Health & Medicine > Therapeutic Area (1.00)
Health & Medicine > Health Care Technology > Medical Record (0.69)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

KAConvText: Novel Approach to Burmese Sentence Classification using Kolmogorov-Arnold Convolution

Thu, Ye Kyaw, Aung, Thura, Oo, Thazin Myint, Supnithi, Thepchai

arXiv.org Artificial IntelligenceJul-10-2025

This paper presents the first application of Kolmogorov-Arnold Convolution for Text (KAConvText) in sentence classification, addressing three tasks: imbalanced binary hate speech detection, balanced multiclass news classification, and imbalanced multiclass ethnic language identification. We investigate various embedding configurations, comparing random to fastText embeddings in both static and fine-tuned settings, with embedding dimensions of 100 and 300 using CBOW and Skip-gram models. Baselines include standard CNNs and CNNs augmented with a Kolmogorov-Arnold Network (CNN-KAN). In addition, we investigated KAConvText with different classification heads - MLP and KAN, where using KAN head supports enhanced interpretability. Results show that KAConvText-MLP with fine-tuned fastText embeddings achieves the best performance of 91.23% accuracy (F1-score = 0.9109) for hate speech detection, 92.66% accuracy (F1-score = 0.9267) for news classification, and 99.82% accuracy (F1-score = 0.9982) for language identification.

f1-score, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2507.06753

Country: Asia > Myanmar (0.16)

Genre:

Research Report > New Finding (0.48)
Research Report > Promising Solution (0.40)
Overview > Innovation (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.71)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.69)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.48)

Add feedback

Two Spelling Normalization Approaches Based on Large Language Models

Domingo, Miguel, Casacuberta, Francisco

arXiv.org Artificial IntelligenceJul-1-2025

The absence of standardized spelling conventions and the organic evolution of human language present an inherent linguistic challenge within historical documents, a longstanding concern for scholars in the humanities. Addressing this issue, spelling normalization endeavors to align a document's orthography with contemporary standards. In this study, we propose two new approaches based on large language models: one of which has been trained without a supervised training, and a second one which has been trained for machine translation. Our evaluation spans multiple datasets encompassing diverse languages and historical periods, leading us to the conclusion that while both of them yielded encouraging results, statistical machine translation still seems to be the most suitable technology for this task.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2506.23288

Country:

Europe > Spain (0.04)
North America > United States > Indiana (0.04)

Genre: Research Report > New Finding (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Add feedback

Comparative analysis of subword tokenization approaches for Indian languages

Das, Sudhansu Bala, Choudhury, Samujjal, Mishra, Tapas Kumar, Patra, Bidyut Kr.

arXiv.org Artificial IntelligenceMay-23-2025

Tokenization is the act of breaking down text into smaller parts, or tokens, that are easier for machines to process. This is a key phase in machine translation (MT) models. Subword tokenization enhances this process by breaking down words into smaller subword units, which is especially beneficial in languages with complicated morphology or a vast vocabulary. It is useful in capturing the intricate structure of words in Indian languages (ILs), such as prefixes, suffixes, and other morphological variations. These languages frequently use agglutinative structures, in which words are formed by the combination of multiple morphemes such as suffixes, prefixes, and stems. As a result, a suitable tokenization strategy must be chosen to address these scenarios. This paper examines how different subword tokenization techniques, such as SentencePiece, Byte Pair Encoding (BPE), and WordPiece Tokenization, affect ILs. The effectiveness of these subword tokenization techniques is investigated in statistical, neural, and multilingual neural machine translation models. All models are examined using standard evaluation metrics, such as the Bilingual Evaluation Understudy (BLEU) score, TER, METEOR, CHRF, RIBES, and COMET. Based on the results, it appears that for the majority of language pairs for the Statistical and Neural MT models, the SentencePiece tokenizer continuously performed better than other tokenizers in terms of BLEU score. However, BPE tokenization outperformed other tokenization techniques in the context of Multilingual Neural Machine Translation model. The results show that, despite using the same tokenizer and dataset for each model, translations from ILs to English surpassed translations from English to ILs.

artificial intelligence, natural language, translation, (15 more...)

arXiv.org Artificial Intelligence

2505.16868

Country: Asia > India (0.28)

Genre:

Research Report (1.00)
Overview (0.87)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Context-aware Stand-alone Neural Spelling Correction

Li, Xiangci, Liu, Hairong, Huang, Liang

arXiv.org Artificial IntelligenceMay-19-2025

Existing natural language processing systems are vulnerable to noisy inputs resulting from misspellings. On the contrary, humans can easily infer the corresponding correct words from their misspellings and surrounding context. Inspired by this, we address the stand-alone spelling correction problem, which only corrects the spelling of each token without additional token insertion or deletion, by utilizing both spelling information and global context representations. We present a simple yet powerful solution that jointly detects and corrects misspellings as a sequence labeling task by fine-turning a pre-trained language model. Our solution outperforms the previous state-of-the-art result by 12.8% absolute F0.5 score.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2011.06642

Country: North America > United States (0.47)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.94)

Add feedback

Domain-Specific Machine Translation to Translate Medicine Brochures in English to Sorani Kurdish

Shamal, Mariam, Hassani, Hossein

arXiv.org Artificial IntelligenceJan-23-2025

Access to Kurdish medicine brochures is limited, depriving Kurdish-speaking communities of critical health information. To address this problem, we developed a specialized Machine Translation (MT) model to translate English medicine brochures into Sorani Kurdish using a parallel corpus of 22,940 aligned sentence pairs from 319 brochures, sourced from two pharmaceutical companies in the Kurdistan Region of Iraq (KRI). We trained a Statistical Machine Translation (SMT) model using the Moses toolkit, conducting seven experiments that resulted in BLEU scores ranging from 22.65 to 48.93. We translated three new brochures to improve the evaluation process and encountered unknown words. We addressed unknown words through post-processing with a medical dictionary, resulting in BLEU scores of 56.87, 31.05, and 40.01. Human evaluation by native Kurdish-speaking pharmacists, physicians, and medicine users showed that 50% of professionals found the translations consistent, while 83.3% rated them accurate. Among users, 66.7% considered the translations clear and felt confident using the medications.

artificial intelligence, brochure, natural language, (14 more...)

arXiv.org Artificial Intelligence

2501.13609

Country:

Europe > Middle East (0.04)
Europe > Greece (0.04)
Europe > France > Occitanie > Hérault > Montpellier (0.04)
(4 more...)

Genre: Research Report (1.00)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.78)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)

Add feedback

Artificial intelligence contribution to translation industry: looking back and forward

Shormani, Mohammed Q.

arXiv.org Artificial IntelligenceDec-2-2024

This study provides a comprehensive analysis of artificial intelligence (AI) contribution to translation industry (ACTI) research, synthesizing it over forty-one years from 1980-2024. 13220 articles were retrieved from three sources, namely WoS, Scopus, and Lens. We provided two types of analysis, viz., scientometric and thematic, focusing on cluster, subject categories, keywords, burstness, centrality and research centers as for the former. For the latter, we thematically review 18 articles, selected purposefully from the articles involved, centering on purpose, approach, findings, and contribution to ACTI future directions. The findings reveal that in the past AI contribution to translation industry was not rigorous, resulting in rule-based machine translation and statistical machine translation whose output was not satisfactory. However, the more AI develops, the more machine translation develops, incorporating Neural Networking Algorithms and (Deep) Language Learning Models like ChatGPT whose translation output has developed considerably. However, much rigorous research is still needed to overcome several problems encountering translation industry, specifically concerning low-source languages, multi-dialectical and free word order languages, and cultural and religious registers.

acti research, machine translation, translation, (13 more...)

arXiv.org Artificial Intelligence

2411.19855

Country:

South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
South America > Argentina > Patagonia > Río Negro Province > Viedma (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
(7 more...)

Genre:

Research Report (1.00)
Overview (1.00)

Industry: Education (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Improving Statistical Significance in Human Evaluation of Automatic Metrics via Soft Pairwise Accuracy

Thompson, Brian, Mathur, Nitika, Deutsch, Daniel, Khayrallah, Huda

arXiv.org Artificial IntelligenceSep-14-2024

Selecting an automatic metric that best emulates human judgments is often non-trivial, because there is no clear definition of "best emulates." A meta-metric is required to compare the human judgments to the automatic metric judgments, and metric rankings depend on the choice of meta-metric. We propose Soft Pairwise Accuracy (SPA), a new meta-metric that builds on Pairwise Accuracy (PA) but incorporates the statistical significance of both the human judgments and the metric judgments. SPA allows for more fine-grained comparisons between systems than a simplistic binary win/loss, and addresses a number of shortcomings with PA: it is more stable with respect to both the number of systems and segments used for evaluation, it mitigates the issue of metric ties due to quantization, and it produces more statistically significant results. SPA was selected as the official system-level metric for the 2024 WMT metric shared task.

computational linguistic, machine translation, metric, (12 more...)

arXiv.org Artificial Intelligence

2409.09598

Country:

Asia > Singapore (0.05)
Europe > Portugal > Lisbon > Lisbon (0.04)
Europe > Bulgaria > Sofia City Province > Sofia (0.04)
(20 more...)

Genre: Research Report > Experimental Study (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (0.68)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.50)

Add feedback

ChatCite: LLM Agent with Human Workflow Guidance for Comparative Literature Summary

Li, Yutong, Chen, Lu, Liu, Aiwei, Yu, Kai, Wen, Lijie

arXiv.org Artificial IntelligenceMar-4-2024

The literature review is an indispensable step in the research process. It provides the benefit of comprehending the research problem and understanding the current research situation while conducting a comparative analysis of prior works. However, literature summary is challenging and time consuming. The previous LLM-based studies on literature review mainly focused on the complete process, including literature retrieval, screening, and summarization. However, for the summarization step, simple CoT method often lacks the ability to provide extensive comparative summary. In this work, we firstly focus on the independent literature summarization step and introduce ChatCite, an LLM agent with human workflow guidance for comparative literature summary. This agent, by mimicking the human workflow, first extracts key elements from relevant literature and then generates summaries using a Reflective Incremental Mechanism. In order to better evaluate the quality of the generated summaries, we devised a LLM-based automatic evaluation metric, G-Score, in refer to the human evaluation criteria. The ChatCite agent outperformed other models in various dimensions in the experiments. The literature summaries generated by ChatCite can also be directly used for drafting literature reviews.

literature summary, machine translation, translation, (15 more...)

arXiv.org Artificial Intelligence

2403.02574

Country:

Asia > China > Shanghai > Shanghai (0.04)
Asia > China > Beijing > Beijing (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)

Genre:

Research Report > New Finding (1.00)
Overview (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

IndicTrans2: Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled Indian Languages

Gala, Jay, Chitale, Pranjal A., AK, Raghavan, Gumma, Varun, Doddapaneni, Sumanth, Kumar, Aswanth, Nawale, Janki, Sujatha, Anupama, Puduppully, Ratish, Raghavan, Vivek, Kumar, Pratyush, Khapra, Mitesh M., Dabre, Raj, Kunchukuttan, Anoop

arXiv.org Artificial IntelligenceDec-20-2023

India has a rich linguistic landscape with languages from 4 major language families spoken by over a billion people. 22 of these languages are listed in the Constitution of India (referred to as scheduled languages) are the focus of this work. Given the linguistic diversity, high-quality and accessible Machine Translation (MT) systems are essential in a country like India. Prior to this work, there was (i) no parallel training data spanning all 22 languages, (ii) no robust benchmarks covering all these languages and containing content relevant to India, and (iii) no existing translation models which support all the 22 scheduled languages of India. In this work, we aim to address this gap by focusing on the missing pieces required for enabling wide, easy, and open access to good machine translation systems for all 22 scheduled Indian languages. We identify four key areas of improvement: curating and creating larger training datasets, creating diverse and high-quality benchmarks, training multilingual models, and releasing models with open access. Our first contribution is the release of the Bharat Parallel Corpus Collection (BPCC), the largest publicly available parallel corpora for Indic languages. BPCC contains a total of 230M bitext pairs, of which a total of 126M were newly added, including 644K manually translated sentence pairs created as part of this work. Our second contribution is the release of the first n-way parallel benchmark covering all 22 Indian languages, featuring diverse domains, Indian-origin content, and source-original test sets. Next, we present IndicTrans2, the first model to support all 22 languages, surpassing existing models on multiple existing and new benchmarks created as a part of this work. Lastly, to promote accessibility and collaboration, we release our models and associated data with permissive licenses at https://github.com/AI4Bharat/IndicTrans2.

bharat parallel corpus collection, european language resource association, indictrans2-m2m model, (12 more...)

arXiv.org Artificial Intelligence

2305.16307

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.27)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.13)
Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
(45 more...)

Genre:

Research Report > New Finding (1.00)
Overview (1.00)

Industry:

Law (1.00)
Education (1.00)
Government > Regional Government > Asia Government > India Government (0.66)
Consumer Products & Services > Travel (0.45)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.92)

Add feedback